Introduction

The goal of this notebook is to validate the best model identified in the previous work. Here, we follow two different applications:

  • To balance on the other covariates (e.g., environment and performance metrics), then look at the difference in the user engagement metrics between the balanced Beta and Release for that version (N). This gives us an idea of how clients with similar environments and performance resemble Release in terms of usage.
  • To balance the Beta and Release datasets to resemble each other across the covariates we are concerned with. Balancing, in this case, yields a set of client_id for Beta that resembles Release. Our application is then querying the current Beta data (Version N+1) for this client_id, and then calculate the metrics we care about from the covariates we care about. This is our outcome.

Data Preparation

Training

rows columns discrete_columns continuous_columns all_missing_columns total_missing_values complete_rows total_observations memory_usage
302819 96 7 89 0 0 302819 29070624 176871312

First Application - Training Daset (V67)

In this application, we need to balance the two groups (Beta and Release) considering the other covariates (e.g., environment and performance metrics) and then look at the difference in user engagement metrics between the balanced Beta and Release for that version (N). The utility of this application is to inform us on how Beta is different concerning Release in user engagement, with all the other covariates being equal.

Modeling

Setting the selected expirement from previous work.

Match using Nearest Neighbor matching: Full Dataset

The best model from previous work.

## 
## Call:
## matchit(formula = generate_formula(covariates, label), data = df_train_1x, 
##     method = "nearest", replace = TRUE)
## 
## Summary of balance for all data:
##                                   Means Treated Means Control SD Control
## distance                                 0.6236        0.3764     0.2021
## daily_num_sessions_started               2.8999        2.3689     2.7099
## daily_num_sessions_started_max           5.2789        4.2814     4.8046
## FX_PAGE_LOAD_MS_2_PARENT              3027.2395     3463.7084  1920.1256
## memory_mb                             9437.1264     8965.1561  7925.6741
## num_active_days                          5.5807        5.3462     2.2644
## num_addons                               5.6489        7.8554     3.3349
## num_bookmarks                          160.3317      242.4878  1292.6145
## profile_age                            896.6628      893.7534   762.4791
## session_length                           9.2420       12.2962    14.7435
## session_length_max                      18.1651       22.7066    30.2272
## TIME_TO_DOM_COMPLETE_MS               3286.9842     4388.9143  4254.1363
## TIME_TO_DOM_CONTENT_LOADED_END_MS     2293.7636     2737.6381  2700.5353
## TIME_TO_DOM_INTERACTIVE_MS            1792.4702     2404.3932  2397.0177
## TIME_TO_LOAD_EVENT_END_MS             3011.2550     4126.9123  4007.0480
## TIME_TO_NON_BLANK_PAINT_MS            1442.6659     1833.6557  2128.4506
##                                    Mean Diff  eQQ Med  eQQ Mean     eQQ Max
## distance                              0.2472   0.2578    0.2472      0.3180
## daily_num_sessions_started            0.5310   0.4250    0.5318      1.6250
## daily_num_sessions_started_max        0.9975   1.0000    0.9994     12.0000
## FX_PAGE_LOAD_MS_2_PARENT           -436.4690 294.4388  436.5274   1248.9389
## memory_mb                           471.9702  27.0000  520.5812 196252.0000
## num_active_days                       0.2345   0.0000    0.2465      1.0000
## num_addons                           -2.2064   2.0000    2.2088    114.0000
## num_bookmarks                       -82.1561   1.0000   82.6463  21769.0000
## profile_age                           2.9094  25.0000   25.5655   1384.0000
## session_length                       -3.0542   1.4295    3.0567    150.2233
## session_length_max                   -4.5415   2.6847    4.5523    935.7919
## TIME_TO_DOM_COMPLETE_MS           -1101.9301 418.7669 1101.9620  15499.6121
## TIME_TO_DOM_CONTENT_LOADED_END_MS  -443.8745 228.7603  443.9028  10881.3696
## TIME_TO_DOM_INTERACTIVE_MS         -611.9231 242.9859  611.9817  22911.2000
## TIME_TO_LOAD_EVENT_END_MS         -1115.6573 438.5717 1115.7014  16697.6975
## TIME_TO_NON_BLANK_PAINT_MS         -390.9897 156.9658  391.0893  29396.6429
## 
## 
## Summary of balance for matched data:
##                                   Means Treated Means Control SD Control
## distance                                 0.6236        0.6236     0.2010
## daily_num_sessions_started               2.8999        4.5150     5.3190
## daily_num_sessions_started_max           5.2789        8.3996     9.9855
## FX_PAGE_LOAD_MS_2_PARENT              3027.2395     2898.3402  1631.7936
## memory_mb                             9437.1264    13929.8965 16819.5954
## num_active_days                          5.5807        5.9742     2.0587
## num_addons                               5.6489        6.2318     1.9625
## num_bookmarks                          160.3317      543.9529  3192.8106
## profile_age                            896.6628      925.0090   785.0875
## session_length                           9.2420        8.3001    13.0964
## session_length_max                      18.1651       19.1759    63.4434
## TIME_TO_DOM_COMPLETE_MS               3286.9842     3144.0420  3117.5267
## TIME_TO_DOM_CONTENT_LOADED_END_MS     2293.7636     2837.8774  3917.2605
## TIME_TO_DOM_INTERACTIVE_MS            1792.4702     1719.1141  1783.4556
## TIME_TO_LOAD_EVENT_END_MS             3011.2550     2798.3931  2634.1006
## TIME_TO_NON_BLANK_PAINT_MS            1442.6659     1396.3843  1694.5964
##                                    Mean Diff  eQQ Med eQQ Mean     eQQ Max
## distance                              0.0000   0.1678   0.1611      0.2054
## daily_num_sessions_started           -1.6151   0.2143   0.2369      2.7500
## daily_num_sessions_started_max       -3.1207   0.0000   0.4142     12.0000
## FX_PAGE_LOAD_MS_2_PARENT            128.8992  98.7674 203.3276    908.3304
## memory_mb                         -4492.7701  23.0000 613.4854 195904.0000
## num_active_days                      -0.3935   0.0000   0.1592      1.0000
## num_addons                           -0.5829   1.5000   1.4790      8.4000
## num_bookmarks                      -383.6211   2.0000  85.8880  21043.6250
## profile_age                         -28.3463  17.0000  23.3302   1484.0000
## session_length                        0.9419   0.1518   1.4165    143.4417
## session_length_max                   -1.0108   0.3483   2.3806    969.3572
## TIME_TO_DOM_COMPLETE_MS             142.9422 147.9143 549.9064   6621.0697
## TIME_TO_DOM_CONTENT_LOADED_END_MS  -544.1138  91.2539 265.8759   4801.4920
## TIME_TO_DOM_INTERACTIVE_MS           73.3560  91.9479 310.3343   4396.0455
## TIME_TO_LOAD_EVENT_END_MS           212.8619 144.2673 548.0156   6756.1084
## TIME_TO_NON_BLANK_PAINT_MS           46.2817  61.4867 197.9214  13681.3200
## 
## Percent Balance Improvement:
##                                   Mean Diff.   eQQ Med eQQ Mean  eQQ Max
## distance                            100.0000   34.9070  34.8278  35.4074
## daily_num_sessions_started         -204.1764   49.5798  55.4625 -69.2308
## daily_num_sessions_started_max     -212.8466  100.0000  58.5504   0.0000
## FX_PAGE_LOAD_MS_2_PARENT             70.4677   66.4557  53.4216  27.2718
## memory_mb                          -851.9181   14.8148 -17.8462   0.1773
## num_active_days                     -67.8061    0.0000  35.4217   0.0000
## num_addons                           73.5837   25.0000  33.0419  92.6316
## num_bookmarks                      -366.9418 -100.0000  -3.9224   3.3321
## profile_age                        -874.2978   32.0000   8.7436  -7.2254
## session_length                       69.1599   89.3797  53.6585   4.5144
## session_length_max                   77.7431   87.0253  47.7048  -3.5868
## TIME_TO_DOM_COMPLETE_MS              87.0280   64.6786  50.0975  57.2824
## TIME_TO_DOM_CONTENT_LOADED_END_MS   -22.5828   60.1094  40.1049  55.8742
## TIME_TO_DOM_INTERACTIVE_MS           88.0122   62.1592  49.2903  80.8127
## TIME_TO_LOAD_EVENT_END_MS            80.9205   67.1052  50.8815  59.5387
## TIME_TO_NON_BLANK_PAINT_MS           88.1629   60.8280  49.3923  53.4596
## 
## Sample sizes:
##           Control Treated
## All         59627   59627
## Matched     21051   59627
## Unmatched   38576       0
## Discarded       0       0
##                                                Stratified by label
##                                                 beta             
##   n                                               21051          
##   daily_num_sessions_started (mean (SD))           2.75 (3.17)   
##   daily_num_sessions_started_max (mean (SD))       5.01 (5.67)   
##   FX_PAGE_LOAD_MS_2_PARENT (mean (SD))          3229.24 (1805.07)
##   memory_mb (mean (SD))                         9530.79 (9308.07)
##   num_active_days (mean (SD))                      5.55 (2.19)   
##   num_addons (mean (SD))                           7.12 (2.55)   
##   num_bookmarks (mean (SD))                      244.46 (1473.54)
##   profile_age (mean (SD))                        911.98 (777.07) 
##   session_length (mean (SD))                      10.63 (13.05)  
##   session_length_max (mean (SD))                  20.46 (31.85)  
##   TIME_TO_DOM_COMPLETE_MS (mean (SD))           3836.39 (3693.24)
##   TIME_TO_DOM_CONTENT_LOADED_END_MS (mean (SD)) 2558.47 (2672.28)
##   TIME_TO_DOM_INTERACTIVE_MS (mean (SD))        2101.36 (2058.07)
##   TIME_TO_LOAD_EVENT_END_MS (mean (SD))         3558.01 (3430.90)
##   TIME_TO_NON_BLANK_PAINT_MS (mean (SD))        1640.41 (1881.21)
##                                                Stratified by label
##                                                 release           SMD   
##   n                                               59627                 
##   daily_num_sessions_started (mean (SD))           2.90 (2.94)     0.050
##   daily_num_sessions_started_max (mean (SD))       5.28 (5.31)     0.049
##   FX_PAGE_LOAD_MS_2_PARENT (mean (SD))          3027.24 (1578.89)  0.119
##   memory_mb (mean (SD))                         9437.13 (8683.71)  0.010
##   num_active_days (mean (SD))                      5.58 (2.06)     0.015
##   num_addons (mean (SD))                           5.65 (2.22)     0.615
##   num_bookmarks (mean (SD))                      160.33 (661.20)   0.074
##   profile_age (mean (SD))                        896.66 (771.74)   0.020
##   session_length (mean (SD))                       9.24 (9.47)     0.122
##   session_length_max (mean (SD))                  18.17 (19.46)    0.087
##   TIME_TO_DOM_COMPLETE_MS (mean (SD))           3286.98 (2685.73)  0.170
##   TIME_TO_DOM_CONTENT_LOADED_END_MS (mean (SD)) 2293.76 (2234.97)  0.107
##   TIME_TO_DOM_INTERACTIVE_MS (mean (SD))        1792.47 (1488.20)  0.172
##   TIME_TO_LOAD_EVENT_END_MS (mean (SD))         3011.26 (2427.41)  0.184
##   TIME_TO_NON_BLANK_PAINT_MS (mean (SD))        1442.67 (1380.44)  0.120


Observations

  • The Love plot is a summary plot of covariate balance pre and post matching. Each point represents the balance statistic for that covariate, colored based on whether it is calculated before or after adjustment. The dotted lines represent the threshold set (\(0.1\)); if most or all of the points after adjustment are within the threshold, that is good evidence that balance has been achieved
  • The plot shows that, for unadjusted cases (pre-matching), the standardized mean difference is not relatively large (except for num_addons). However, for adjusted cases (post-matching), the standardized mean difference is smaller. That is, for most cases, the absolute value is even smaller than the threshold (\(0.1\))
  • Therefore, this result is a good evidence that the matching worked well, as the balance was improved on almost all variables after adjustment

Post-matching Beta-Release Difference

active_hours active_hours_max uri_count uri_count_max search_count search_count_max num_pages num_pages_max daily_max_tabs daily_max_tabs_max daily_unique_domains daily_unique_domains_max daily_tabs_opened daily_tabs_opened_max
beta (mean) 0.7959226 1.5377521 150.2729335 308.9288395 2.3187671 5.4231153 1.632206e+04 1.650547e+04 8.3707078 12.4062515 4.7555407 8.2166331 18.4296738 36.0778110
release (mean) 0.8447582 1.6203453 156.1124914 319.3721301 2.3742188 5.4467272 1.738034e+04 1.757009e+04 6.1059912 9.2342731 4.9637533 8.5511265 16.9665119 33.0464555
delta (mean) 0.0578102 0.0509726 0.0374061 0.0326994 0.0233558 0.0043351 6.088980e-02 6.059230e-02 0.3709007 0.3435006 0.0419466 0.0391169 0.0862382 0.0917301
beta (median) 0.5187500 1.0513889 86.4000000 174.0000000 0.8000000 2.0000000 3.869700e+03 4.032000e+03 3.8333333 6.0000000 3.3571429 5.0000000 8.2857143 16.0000000
release (median) 0.5743056 1.1527778 97.2500000 196.0000000 0.8750000 2.0000000 5.571333e+03 5.746000e+03 3.7142857 6.0000000 3.6000000 6.0000000 8.8333333 17.0000000
delta (median) 0.0967352 0.0879518 0.1115681 0.1122449 0.0857143 0.0000000 3.054266e-01 2.982945e-01 0.0320513 0.0000000 0.0674603 0.1666667 0.0619946 0.0588235
metric label active_hours active_hours_max uri_count uri_count_max search_count search_count_max num_pages num_pages_max daily_max_tabs daily_max_tabs_max daily_unique_domains daily_unique_domains_max daily_tabs_opened daily_tabs_opened_max
mean beta 0.8236611 1.577508 152.74550 311.0213 2.4506498 5.636171 17363.463 17558.93 9.603628 13.811495 5.060464 8.743610 20.491908 39.64786
mean beta - matched 0.7959226 1.537752 150.27293 308.9288 2.3187671 5.423115 16322.057 16505.47 8.370708 12.406251 4.755541 8.216633 18.429674 36.07781
mean release 0.8447582 1.620345 156.11249 319.3721 2.3742188 5.446727 17380.343 17570.09 6.105991 9.234273 4.963753 8.551127 16.966512 33.04646
median beta 0.5309524 1.063889 86.66667 172.0000 0.8333333 2.000000 4185.667 4340.00 4.250000 6.000000 3.562500 5.500000 9.000000 17.00000
median beta - matched 0.5187500 1.051389 86.40000 174.0000 0.8000000 2.000000 3869.700 4032.00 3.833333 6.000000 3.357143 5.000000 8.285714 16.00000
median release 0.5743056 1.152778 97.25000 196.0000 0.8750000 2.000000 5571.333 5746.00 3.714286 6.000000 3.600000 6.000000 8.833333 17.00000

Kolmogorov-Smirnov test (KS)

We can use the Kolmogorov-Smirnov test (KS) to verify the differences between the balanced Beta and Release for the v67 version. It is between 0 and 1, and represents how two data sets are similar. Smaller KS distance values indicate better balance.

KS
active_hours 0.0505550
active_hours_max 0.0457469
uri_count 0.0510276
uri_count_max 0.0522226
search_count 0.0179955
search_count_max 0.0193284
num_pages 0.0673132
num_pages_max 0.0662038
daily_max_tabs 0.0466099
daily_max_tabs_max 0.0434794
daily_unique_domains 0.0472977
daily_unique_domains_max 0.0476850
daily_tabs_opened 0.0331167
daily_tabs_opened_max 0.0378840

Visual inspection

Here, we display density plots for the two groups on the given user engagement metric, so we can visually compare their distribution. The degree to which the densities for the two groups overlap is a good measure of group balance on the given covariate; significant differences in shape can be indicative of poor balance, even when the mean differences and variance ratios are well within thresholds.

The following violin plots depicts distributions for the following subsets:

  • Beta v67: pre-matching
  • Beta v67: matched and subsetted
  • Release v67

NOTE: Guiding lines have been added for the following:

  • black solid: Release mean
  • black dashed: Release median
  • red dashed line: subsetted Beta mean.


Observations

The density and violin plots show that there are significant differences between both groups (Beta and Release) concerning some user engagement metrics, listed as follows.

  • num_pages
  • num_pages_max
  • active_hours
  • active_hours_max
  • uri_count
  • uri_count_max
  • daily_unique_domains
  • daily_unique_domains_max

Second Application - Validation dataset (V68)

In this application, we need to balance the Beta and Release datasets to resemble each other across the covariates we are concerned with, that is, the user engagement metrics. Balancing, in this case, yields a set of client_id for Beta that resembles Release. This gives us an idea of how these users do indeed change in time. If we see changes that are larger than anticipated, then we know that something significant is happening in user engagement that we can “forecast” in the subsequent Release.

First, we determine the number of training (v67) Beta and Release clients that are in the validation set (v68).

##     label  freq
## 1    beta 38861
## 2 release 38184

Let’s compare this to existing distribution:

## Percentage of beta mutual clients: 65 %
## Percentage of release mutual clients: 64 %

Hence, most training clients (65%) are in the validation set.

Holdout Covariates

  • User engagement metrics:
    • active_hours
    • active_hours_max
    • uri_count
    • uri_count_max
    • search_count
    • search_count_max
    • num_pages
    • num_pages_max
    • daily_max_tabs
    • daily_max_tabs_max
    • daily_unique_domains
    • daily_unique_domains_max
    • daily_tabs_opened
    • daily_tabs_opened_max

Subset the validation clients down to those matched:

##     label  freq
## 1    beta 14132
## 2 release 10467

Training and Validation Difference:

Mean

active_hours active_hours_max uri_count uri_count_max search_count search_count_max num_pages num_pages_max daily_max_tabs daily_max_tabs_max daily_unique_domains daily_unique_domains_max daily_tabs_opened daily_tabs_opened_max
pre-matching 0.0646358 0.1028066 0.0816237 0.1303186 0.0581926 0.1046028 0.0923660 0.0930137 0.4204475 0.3377663 0.0006247 0.0364702 0.1570896 0.1021035
post-matching 0.0947241 0.0926328 0.0725612 0.0879248 0.0372849 0.0350132 0.0596316 0.0604804 0.3614949 0.2775021 0.0162564 0.0114660 0.0277262 0.0180453

Median

active_hours active_hours_max uri_count uri_count_max search_count search_count_max num_pages num_pages_max daily_max_tabs daily_max_tabs_max daily_unique_domains daily_unique_domains_max daily_tabs_opened daily_tabs_opened_max
pre-matching 0.1259557 0.175981 0.1785714 0.2437811 0.25 0.3333333 0.3768946 0.3679381 0.0855263 0 0.0454545 0.1333333 0.0555556 0.1176471
post-matching 0.1281585 0.127551 0.1351068 0.1603376 0.00 0.0000000 0.2484032 0.2429324 0.0526316 0 0.0316688 0.0769231 0.0847458 0.1052632

metric label active_hours active_hours_max uri_count uri_count_max search_count search_count_max num_pages num_pages_max daily_max_tabs daily_max_tabs_max daily_unique_domains daily_unique_domains_max daily_tabs_opened daily_tabs_opened_max
mean beta 0.7988445 1.471189 146.33392 287.3003 2.324319 5.100206 15614.038 15779.79 9.019717 12.828446 5.148112 8.581990 20.03166 37.29553
mean beta - matched 0.8617226 1.657703 161.98428 336.1304 2.510444 5.829748 20109.092 20310.03 8.297907 12.104231 5.411060 9.465749 18.80546 37.14110
mean release 0.8540465 1.639768 159.33983 330.3512 2.467935 5.696027 17203.011 17398.05 6.349912 9.589452 5.144898 8.906823 17.31211 33.84032
median beta 0.5027778 0.962500 80.50000 152.0000 0.750000 2.000000 3347.500 3513.00 4.125000 6.000000 3.500000 5.200000 8.50000 15.00000
median beta - matched 0.5830440 1.187500 98.30952 199.0000 1.000000 3.000000 6628.833 6828.75 4.000000 6.000000 3.785005 6.000000 9.00000 17.00000
median release 0.5752315 1.168056 98.00000 201.0000 1.000000 3.000000 5372.286 5558.00 3.800000 6.000000 3.666667 6.000000 9.00000 17.00000

Kolmogorov-Smirnov test (KS)

Once again, we use the KS test to verify whether any significant difference between the average user engagement metrics in the Beta and Release groups, over several versions (v67 and v68). Reminder: smaller KS distance values indicate a better balance.

KS
active_hours 0.0072743
active_hours_max 0.0092511
uri_count 0.0071471
uri_count_max 0.0092912
search_count 0.0086209
search_count_max 0.0089323
num_pages 0.0433436
num_pages_max 0.0432005
daily_max_tabs 0.0479976
daily_max_tabs_max 0.0436476
daily_unique_domains 0.0178247
daily_unique_domains_max 0.0216601
daily_tabs_opened 0.0179124
daily_tabs_opened_max 0.0189196

Visual inspection

The following violin plots depicts distributions for the following subsets:

  • Beta v68: pre-matching
  • Beta v68: matched and subsetted
  • Release v68

NOTE: Guiding lines have been added for the following:

  • black solid: Release mean
  • black dashed: Release median
  • red dashed line: subsetted Beta mean.


Observations

Our main objective was to determine if the user engagement metrics changed in the newest Beta version concerning the previous Release version. The density and violin plots show that there are significant differences between both groups (Beta and Release) concerning some user engagement metrics, listed as follows.

  • daily_max_tabs
  • daily_max_tabs_max
  • num_pages
  • num_pages_max
  • daily_unique_domains